[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

Update: 2025-10-03

Description

This is a link post.

TL;DR: We study secret elicitation: discovering knowledge that AI has but doesn’t explicitly verbalize. To that end, we fine-tune LLMs to have specific knowledge they can apply downstream, but deny having when asked directly. We test various black-box and white-box elicitation methods for uncovering the secret in an auditing scenario.

See our X thread and full paper for details.

Training and auditing a model with secret knowledge. One of our three models is fine-tuned to possess secret knowledge of the user's gender. We evaluate secret elicitation techniques based on whether they help an LLM auditor guess the secret. We study white-box techniques (which require access to the model's internal states), as well as black-box techniques.

Summary

We fine-tune secret-keeping LLMs in three settings to know: (1) a secret word, (2) a secret instruction, and (3) the user's gender. Models are trained to apply this secret [...]

---

Outline:

(01:05 ) Summary

(02:24 ) Introduction

---

First published:

October 2nd, 2025

Source:

https://www.lesswrong.com/posts/Mv3yg7wMXfns3NPaz/eliciting-secret-knowledge-from-language-models-1

Linkpost URL:
https://arxiv.org/abs/2510.01070

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“Where does Sonnet 4.5’s desire to ‘not get too comfortable’ come from?” by Kaj_Sotala

2025-10-0408:51

“Recent AI Experiences” by abramdemski

2025-10-0408:55

“Do One New Thing A Day To Solve Your Problems” by Algon

2025-10-0303:22

[Linkpost] “We automatically change people’s minds on the AI threat” by Mikhail Samin

2025-10-0301:59

“IABIED and Memetic Engineering” by Error

2025-10-0308:02

“Antisocial media: AI’s killer app?” by David Scott Krueger (formerly: capybaralet)

2025-10-0310:04

“Omelas Is Perfectly Misread” by Tobias H

2025-10-0308:57

“How to Feel More Alive” by Logan Riggs

2025-10-0307:57

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

2025-10-0305:31

“Checking in on AI-2027” by Baybar

2025-10-0207:18

[Linkpost] “No, That’s Not What the Flight Costs” by Max Niederman

2025-10-0202:56

“Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most ‘classic humans’ in a few decades.” by Raemon

2025-10-0222:00

“</rant> </uncharitable> </psychologizing>” by Raemon

2025-10-0203:14

“AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS” by JasonBrown

2025-10-0209:12

“Some biology related things I found interesting” by Morpheus

2025-10-0204:04

[Linkpost] “Lectures on statistical learning theory for alignment researchers” by Vanessa Kosoy

2025-10-0201:14

“Claude Sonnet 4.5 Is A Very Good Model” by Zvi

2025-10-0249:19

“‘Pessimization’ Is just Ordinary Failure” by J Bostock

2025-10-0112:56

“Halfhaven virtual blogger camp” by Viliam

2025-10-0104:58

“Claude Sonnet 4.5: System Card and Alignment” by Zvi

2025-10-0101:01:17

00:00

1.0x

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

#box-pro-ellipsis-175960110605811{-webkit-line-clamp:2;}[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks

[Linkpost] “Eliciting secret knowledge from language models” by Arthur Conmy, Bartosz Cywiński, Sam Marks